A Comparative Study on Representation of Web Pages in Automatic Text Categorization
نویسندگان
چکیده
With many web sites appearing everyday, it has become increasingly difficult to keep the web directories up-to-date and growing. The interest in the usage of machine learning on automatic text categorization is further stimulated with this intensive growth of World Wide Web. We believe that Web page classification is significantly different from a traditional text classification because of the presence of some additional information, provided by the HTML structure and by the presence of the hyperlinks. In this paper, our objective is to analyze different combinations to represent the training documents and the test documents for SVM classifier. Our experiments show that in addition to the content of the web site, using further the META data and the extended inbound anchor text information in representing the Web sites, enhances the performance of the classification. Moreover, utilizing the expected entropy loss values for the purpose of weighting the term frequencies in the feature vector provides further performance enhancement in SVM classifier.
منابع مشابه
Automated multi-label text categorization with VG-RAM weightless neural networks
In automated multi-label text categorization, an automatic categorization system should output a label set, whose size is unknown a priori, for each document under analysis. Many machine learning techniques have been used for building such automatic text categorization systems. In this paper, we examine virtual generalizing random access memory weightless neural networks (VG-RAM WNN), an effect...
متن کاملUsing neighborhood information for automated categorization of Web pages
In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages. We consider a Web page expansion by using text content of it’s linking pages. We show that naive expansion can grab too much noise and essentially harm categorization results. We present the approach to automated pruning of linking Web...
متن کاملArabic text categorization: a comparative study of different representation modes
The quantity of accessible information on Internet is phenomenal, and its categorization remains one of the most important problems. A lot of work is currently focused on English rightly since; it is the dominant language of the Web. However, a need arises for the other languages, because the Web is each day more multilingual. The need is much more pressing for the Arabic language. Our research...
متن کاملPresenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملWeb Classification Approach Using Reduced Vector Representation Model Based on Html Tags
Automatic web page classification plays an essential role in information retrieval, web mining and web semantics applications. Web pages have special characteristics (such as HTML tags, hyperlinks, etc....) that make their classification different from standard text categorization. Thus, when applied to web data, traditional text classifiers do not usually produce promising results. In this pap...
متن کامل